Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix] [Enhancement] Fix nullptr and support Iceberg null padding #49212

Merged
merged 1 commit into from
Aug 7, 2024

Conversation

Samrose-Ahmed
Copy link
Contributor

@Samrose-Ahmed Samrose-Ahmed commented Jul 31, 2024

PR #48151 introduced a regression, where what would previously return an Error now caused a nullptr dereference and crashed the entire CN. This change removes the crashing and, for Iceberg, also adds support for padding evolved fields with null values, as per Iceberg spec.

Why I'm doing:

What I'm doing:

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.3
    • 3.2
    • 3.1
    • 3.0
    • 2.5

@@ -274,6 +275,14 @@ Status GroupReader::_create_column_reader(const GroupReaderParam::Column& column
RETURN_IF_ERROR(ColumnReader::create(_column_reader_opts, schema_node, column.slot_type(),
column.t_iceberg_schema_field, &column_reader));
}
if (column_reader == nullptr) {
if (column.t_iceberg_schema_field == nullptr) {
return Status::InternalError("Invalid file: No valid column reader.");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I combined the bugfix and enhancement. Lmk any comments on the null padding behavior, otherwise can split pr and at least just add return error status so doesnt segfault at e.g. column_reader->set_need_parse_levels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Samrose-Ahmed maybe it will be better that split the bugfix and enhancement, as i know, we only materialize column that has at least one subfiled in parquet reader, for a padding column, it's padded at scanner level. cc @Smith-Cruise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes its for struct evolution in Iceberg e.g. you had event: struct<action: string> and you evolved schema to event: struct<action: string, id:string>, according to Iceberg spec, if you select event.id you need to pad the old files that don't have event.id column with null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for this case, we want to query event.id and the former files don't contains this subfield, i actually think we have covered this case, and treat the column as column that not need materialized in the reader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debugged and it crashes on one of our Iceberg tables. I added a unit test to replicate.

@Smith-Cruise
Copy link
Contributor

can you put crash log here?

@Samrose-Ahmed
Copy link
Contributor Author

can you put crash log here?

The crash is at this line when column_reader == nullptr:

column_reader->set_need_parse_levels(true);

Log:

main DEBUG (build 20cb672)
query_id:cf9af543-4f0e-11ef-ad60-3a56daa8801c, fragment_instance:cf9af543-4f0e-11ef-ad60-3a56daa8801e
Hive file path: /data/t8D0YA/ts_hour=2024-07-31-07/b2fd9c85-2c13-4545-b819-7fe66ca9566e.parquet, partition id: -1, length: 11181, offset: 4
tracker:process consumption: 477005088
tracker:query_pool consumption: 25234016
tracker:query_pool/connector_scan consumption: 117899264
tracker:load consumption: 0
tracker:metadata consumption: 0
tracker:tablet_metadata consumption: 0
tracker:rowset_metadata consumption: 0
tracker:segment_metadata consumption: 0
tracker:column_metadata consumption: 0
tracker:tablet_schema consumption: 0
tracker:segment_zonemap consumption: 0
tracker:short_key_index consumption: 0
tracker:column_zonemap_index consumption: 0
tracker:ordinal_index consumption: 0
tracker:bitmap_index consumption: 0
tracker:bloom_filter_index consumption: 0
tracker:compaction consumption: 0
tracker:schema_change consumption: 0
tracker:column_pool consumption: 0
tracker:page_cache consumption: 0
tracker:jit_cache consumption: 0
tracker:update consumption: 0
tracker:chunk_allocator consumption: 0
tracker:clone consumption: 0
tracker:consistency consumption: 0
tracker:datacache consumption: 3184743
tracker:replication consumption: 0
*** Aborted at 1722411050 (unix time) try "date -d @1722411050" if you are using GNU date ***
I20240731 00:30:50.066682 126608369256128 logconfig.cpp:131] Start to release memory of cache
I20240731 00:30:50.066693 126608369256128 logconfig.cpp:133] Release memory of cache success
I20240731 00:30:50.066959 126608369256128 logconfig.cpp:147] je_mallctl execute purge success
I20240731 00:30:50.069147 126608369256128 logconfig.cpp:155] je_mallctl execute dontdump success
PC: @         0x1163d5cd starrocks::parquet::GroupReader::_create_column_reader(starrocks::parquet::GroupReaderParam::Column const&)
*** SIGSEGV (@0x0) received by PID 47216 (TID 0x73264ee006c0) from PID 0; stack trace: ***
    @     0x732a5e4a1ec3 (/usr/lib/x86_64-linux-gnu/libc.so.6+0xa1ec2)
    @         0x149dfaf9 google::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*)
    @     0x732a5e445320 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x4531f)
    @         0x1163d5cd starrocks::parquet::GroupReader::_create_column_reader(starrocks::parquet::GroupReaderParam::Column const&)
    @         0x1163d287 starrocks::parquet::GroupReader::_init_column_readers()
    @         0x1163ac09 starrocks::parquet::GroupReader::init()
    @         0x115c3610 starrocks::parquet::FileReader::_init_group_readers()
    @         0x115be4a4 starrocks::parquet::FileReader::init(starrocks::HdfsScannerContext*)
    @         0x1131b55a starrocks::HdfsParquetScanner::do_open(starrocks::RuntimeState*)
    @         0x112f5137 starrocks::HdfsScanner::open(starrocks::RuntimeState*)
    @         0x11244574 starrocks::connector::HiveDataSource::_init_scanner(starrocks::RuntimeState*)
    @         0x1123d529 starrocks::connector::HiveDataSource::open(starrocks::RuntimeState*)
    @          0xc90edb0 starrocks::pipeline::ConnectorChunkSource::_open_data_source(starrocks::RuntimeState*, bool*)
    @          0xc90f18f starrocks::pipeline::ConnectorChunkSource::_read_chunk(starrocks::RuntimeState*, std::shared_ptr<starrocks::Chunk>*)
    @          0xd1c0d97 starrocks::pipeline::ChunkSource::buffer_next_batch_chunks_blocking(starrocks::RuntimeState*, unsigned long, starrocks::workgroup::WorkGroup const*)
    @          0xc8e1c18 auto starrocks::pipeline::ScanOperator::_trigger_next_scan(starrocks::RuntimeState*, int)::{lambda(auto:1&)#1}::operator()<starrocks::workgroup::YieldContext>(starrocks::workgroup::YieldContext&) const
    @          0xc8e4dd5 void std::__invoke_impl<void, starrocks::pipeline::ScanOperator::_trigger_next_scan(starrocks::RuntimeState*, int)::{lambda(auto:1&)#1}&, starrocks::workgroup::YieldContext&>(std::__invoke_other, starrocks::pipeline::ScanOperator::_trigger_next_scan(starro.
    @          0xc8e4cae std::enable_if<is_invocable_r_v<void, starrocks::pipeline::ScanOperator::_trigger_next_scan(starrocks::RuntimeState*, int)::{lambda(auto:1&)#1}&, starrocks::workgroup::YieldContext&>, void>::type std::__invoke_r<void, starrocks::pipeline::ScanOperator::_tr.
    @          0xc8e4ae2 std::_Function_handler<void (starrocks::workgroup::YieldContext&), starrocks::pipeline::ScanOperator::_trigger_next_scan(starrocks::RuntimeState*, int)::{lambda(auto:1&)#1}>::_M_invoke(std::_Any_data const&, starrocks::workgroup::YieldContext&)
    @          0xcb6c083 std::function<void (starrocks::workgroup::YieldContext&)>::operator()(starrocks::workgroup::YieldContext&) const
    @          0xcb6b56f starrocks::workgroup::ScanTask::run()
    @          0xcbe1b7f starrocks::workgroup::ScanExecutor::worker_thread()
    @          0xcbe182d starrocks::workgroup::ScanExecutor::initialize(int)::{lambda()#1}::operator()() const
    @          0xcbe2be6 void std::__invoke_impl<void, starrocks::workgroup::ScanExecutor::initialize(int)::{lambda()#1}&>(std::__invoke_other, starrocks::workgroup::ScanExecutor::initialize(int)::{lambda()#1}&)
    @          0xcbe28d2 std::enable_if<is_invocable_r_v<void, starrocks::workgroup::ScanExecutor::initialize(int)::{lambda()#1}&>, void>::type std::__invoke_r<void, starrocks::workgroup::ScanExecutor::initialize(int)::{lambda()#1}&>(starrocks::workgroup::ScanExecutor::initialize(.
    @          0xcbe2578 std::_Function_handler<void (), starrocks::workgroup::ScanExecutor::initialize(int)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
    @          0xb0e9a90 std::function<void ()>::operator()() const
    @          0xb76c6ba starrocks::FunctionRunnable::run()
    @          0xb76afb3 starrocks::ThreadPool::dispatch_thread()
    @          0xb779724 void std::__invoke_impl<void, void (starrocks::ThreadPool::*&)(), starrocks::ThreadPool*&>(std::__invoke_memfun_deref, void (starrocks::ThreadPool::*&)(), starrocks::ThreadPool*&)
    @          0xb778bf1 std::__invoke_result<void (starrocks::ThreadPool::*&)(), starrocks::ThreadPool*&>::type std::__invoke<void (starrocks::ThreadPool::*&)(), starrocks::ThreadPool*&>(void (starrocks::ThreadPool::*&)(), starrocks::ThreadPool*&)
    @          0xb777f55 void std::_Bind<void (starrocks::ThreadPool::*(starrocks::ThreadPool*))()>::__call<void, , 0ul>(std::tuple<>&&, std::_Index_tuple<0ul>)
Segmentation fault (core dumped)

@Samrose-Ahmed Samrose-Ahmed force-pushed the iceberg-schema-evol-pad branch 2 times, most recently from 9245031 to 76ca9a0 Compare July 31, 2024 07:54
@zombee0
Copy link
Contributor

zombee0 commented Aug 5, 2024

@Samrose-Ahmed i think the root cause is in IcebergMetaHelper::prepare_read_columns, we checked the validation with iceberg schema, we should check it with materialized_columns schema.

if (!_is_valid_type(parquet_field, iceberg_it->second)) {
            continue;
}

@Smith-Cruise
Copy link
Contributor

IcebergMetaHelper::_is_valid_type() should check based on TIcebergSchemaField and TypeDescriptor both.

You can take a look at bool ParquetMetaHelper::_is_valid_type(), it's checked by TypeDescriptor.

@Samrose-Ahmed
Copy link
Contributor Author

That makes sense actually I'll update the PR

@Samrose-Ahmed
Copy link
Contributor Author

I've updated the PR.

@Samrose-Ahmed Samrose-Ahmed force-pushed the iceberg-schema-evol-pad branch 2 times, most recently from cbb02fd to a7d3b29 Compare August 5, 2024 05:58
Copy link
Contributor

@zombee0 zombee0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, for clang-format, could you run ./clang-format.sh in /path/to/starrocks/code/build-support

@@ -274,6 +274,10 @@ Status GroupReader::_create_column_reader(const GroupReaderParam::Column& column
RETURN_IF_ERROR(ColumnReader::create(_column_reader_opts, schema_node, column.slot_type(),
column.t_iceberg_schema_field, &column_reader));
}
if (column_reader == nullptr) {
// this shouldn't happen but guard
return Status::InternalError("No valid column reader.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good job

@Samrose-Ahmed Samrose-Ahmed force-pushed the iceberg-schema-evol-pad branch 2 times, most recently from 43eb6fe to 2ec8151 Compare August 6, 2024 03:01
@Samrose-Ahmed
Copy link
Contributor Author

This approach doesn't seem to completely work.

I updated it to use field ids but now the new test (TestStructEvolutionPadNull) fails at the same original issue (now returns 'No valid column reader', would crash before w/o check).

The suggested change is fine to handle robustness for type check but it doesn't seem to handle the root issue, which is you need to pad the nested subfield with default value, that doesn't happen with this code (like comment mentions we put nullptr column reader in children_reader, we will append default value for this subfield later. but its never appended later).

It seems like we do need the change I made earlier? thoughts @Smith-Cruise @zombee0

PR StarRocks#48151 introduced a regression, where what would previously return an
Error now caused a nullptr dereference and crashed the entire CN.
This change fixes to handle the case and return nulls.

Signed-off-by: Samrose Ahmed <[email protected]>
@Smith-Cruise
Copy link
Contributor

branch 3.2 you need backport either, plz check it.

@github-actions github-actions bot added the 3.2 label Aug 7, 2024
Copy link

github-actions bot commented Aug 7, 2024

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

github-actions bot commented Aug 7, 2024

[BE Incremental Coverage Report]

pass : 27 / 28 (96.43%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/formats/parquet/group_reader.cpp 1 2 50.00% [306]
🔵 be/src/formats/parquet/meta_helper.cpp 26 26 100.00% []

@packy92 packy92 merged commit e224d5d into StarRocks:main Aug 7, 2024
66 of 68 checks passed
Copy link

github-actions bot commented Aug 7, 2024

@Mergifyio backport branch-3.3

@github-actions github-actions bot removed the 3.3 label Aug 7, 2024
Copy link

github-actions bot commented Aug 7, 2024

@Mergifyio backport branch-3.2

@github-actions github-actions bot removed the 3.2 label Aug 7, 2024
Copy link
Contributor

mergify bot commented Aug 7, 2024

backport branch-3.3

✅ Backports have been created

Copy link
Contributor

mergify bot commented Aug 7, 2024

backport branch-3.2

✅ Backports have been created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants